library(tidyverse) # for graphing and data cleaning
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(ggthemes) # for even more plotting themes
library(geofacet) # for special faceting with US map layout
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal()) # My favorite ggplot() theme :)
#Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
# Seeds/plants (and other garden supply) costs
supply_costs <- read_sheet("https://docs.google.com/spreadsheets/d/1dPVHwZgR9BxpigbHLnA0U99TtVHHQtUzNB9UR0wvb7o/edit?usp=sharing",
col_types = "ccccnn")
# Planting dates and locations
plant_date_loc <- read_sheet("https://docs.google.com/spreadsheets/d/11YH0NtXQTncQbUse5wOsTtLSKAiNogjUA21jnX5Pnl4/edit?usp=sharing",
col_types = "cccnDlc")%>%
mutate(date = ymd(date))
# Tidy Tuesday data
kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
Before starting your assignment, you need to get yourself set up on GitHub and make sure GitHub is connected to R Studio. To do that, you should read the instruction (through the “Cloning a repo” section) and watch the video here. Then, do the following (if you get stuck on a step, don’t worry, I will help! You can always get started on the homework and we can figure out the GitHub piece later):
keep_md: TRUE in the YAML heading. The .md file is a markdown (NOT R Markdown) file that is an interim step to creating the html file. They are displayed fairly nicely in GitHub, so we want to keep it and look at it there. Click the boxes next to these two files, commit changes (remember to include a commit message), and push them (green up arrow).Put your name at the top of the document.
For ALL graphs, you should include appropriate labels.
Feel free to change the default theme, which I currently have set to theme_minimal().
Use good coding practice. Read the short sections on good code with pipes and ggplot2. This is part of your grade!
When you are finished with ALL the exercises, uncomment the options at the top so your document looks nicer. Don’t do it before then, or else you might miss some important warnings and messages.
These exercises will reiterate what you learned in the “Expanding the data wrangling toolkit” tutorial. If you haven’t gone through the tutorial yet, you should do that first.
garden_harvest data to find the total harvest weight in pounds for each vegetable and day of week. Display the results so that the vegetables are rows but the days of the week are columns.garden_harvest %>%
group_by(vegetable, date) %>%
summarize(weight_lbs = weight * 0.00220462, day_of_week = weekdays(date)) %>%
group_by(vegetable, day_of_week) %>%
summarize(total_harvest_weight_lbs = sum(weight_lbs)) %>%
pivot_wider(id_cols = vegetable,
names_from = day_of_week,
values_from = total_harvest_weight_lbs)
garden_harvest data to find the total harvest in pound for each vegetable variety and then try adding the plot variable from the plant_date_loc table. This will not turn out perfectly. What is the problem? How might you fix it?garden_harvest %>%
group_by(vegetable, variety) %>%
summarize(total_harvest_lbs = sum(weight) * 0.00220462) %>%
left_join(plant_date_loc,
by = c("vegetable", "variety"))
There are multiple missing values in the ‘plot’ variable for various vegetables/varieties. I might fix this by recoding the missing values of ‘plot’ to a different value or excluding the missing values entirely.
garden_harvest and supply_cost datasets, along with data from somewhere like this to answer this question. You can answer this in words, referencing various join functions. You don’t need R code but could provide some if it’s helpful.In order to get a better understanding of how much money you saved by gardening rather than buying your vegetables from the supermarket, you would first need to calculate how much you spent on supplies for your garden. Using the 'supply_cost' dataset, you could sum up the *price_with_tax* for each vegetable and variety you purchased to then plant in your garden. Then, you could calculate your total yield for each vegetable and variety, using the 'garden_harvest' dataset. It would be most helpful to calculate the total harvest for each vegetable variety in pounds, like I did in Problem 2. Then, you could calculate how much it would cost to purchase equivalent amounts of each vegetable variety from a supermarket, like Whole Foods. Finally, to see how much money you saved by gardening, subtract your total supply cost from the total supermarket cost of your vegetables.
garden_harvest %>%
filter(vegetable == "tomatoes") %>%
group_by(variety) %>%
summarize(first_harvest_date = min(date), total_harvest_lbs = sum(weight) * 0.00220462) %>%
arrange(first_harvest_date) -> ordered_by_date
ordered_by_date
ordered_by_date %>%
ggplot(aes(y = fct_relevel(variety, "grape", "Big Beef", "Bonny Best", "Better Boy", "Cherokee Purple", "Amish Paste", "Mortgage Lifter", "Jet Star", "Old German", "Black Krim", "Brandywine", "volunteers"), x = total_harvest_lbs)) +
geom_col(fill = "darkred") +
labs(title = "Total Harvest Weight of Different Tomato Varieties (Ordered by First Harvest Date)", x = "Total Harvest Weight (lbs)", y = "Variety")
garden_harvest data, create two new variables: one that makes the varieties lowercase and another that finds the length of the variety name. Arrange the data by vegetable and length of variety name (smallest to largest), with one row for each vegetable variety. HINT: use str_to_lower(), str_length(), and distinct().garden_harvest %>%
distinct(variety) %>%
mutate(variety_lower = str_to_lower(variety), variety_length = str_length(variety)) %>%
arrange(variety_length)
garden_harvest data, find all distinct vegetable varieties that have “er” or “ar” in their name. HINT: str_detect() with an “or” statement (use the | for “or”) and distinct().garden_harvest %>%
select(variety) %>%
distinct() %>%
mutate(has_er_ar = str_detect(variety, "er|ar"))
In this activity, you’ll examine some factors that may influence the use of bicycles in a bike-renting program. The data come from Washington, DC and cover the last quarter of 2014.
{300px}
{300px}
Two data tables are available:
Trips contains records of individual rentalsStations gives the locations of the bike rental stationsHere is the code to read in the data. We do this a little differently than usualy, which is why it is included here rather than at the top of this file. To avoid repeatedly re-reading the files, start the data import chunk with {r cache = TRUE} rather than the usual {r}.
data_site <-
"https://www.macalester.edu/~dshuman1/data/112/2014-Q4-Trips-History-Data.rds"
Trips <- readRDS(gzcon(url(data_site)))
Stations<-read_csv("http://www.macalester.edu/~dshuman1/data/112/DC-Stations.csv")
NOTE: The Trips data table is a random subset of 10,000 trips from the full quarterly data. Start with this small data table to develop your analysis commands. When you have this working well, you should access the full data set of more than 600,000 events by removing -Small from the name of the data_site.
It’s natural to expect that bikes are rented more at some times of day, some days of the week, some months of the year than others. The variable sdate gives the time (including the date) that the rental started. Make the following plots and interpret them:
sdate. Use geom_density().Trips %>%
ggplot(aes(x = sdate)) +
geom_density(fill = "blue") +
labs(title = "Start Date/Time vs. Events", x = "Start Date/Time of Rental", y = "Events")
This density plot shows that events were most frequent in October and November, then decreased going into December and January. This presumably due to the fact that the weather in October and November is warmer and more conducive to riding a bike.
mutate() with lubridate’s hour() and minute() functions to extract the hour of the day and minute within the hour from sdate. Hint: A minute is 1/60 of an hour, so create a variable where 3:30 is 3.5 and 3:45 is 3.75.Trips %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "blue") +
labs(title = "Time of Day vs. Events", x = "Time of Day", y = "Events")
Events are most frequent around 8:00 AM and 6:00 PM. Between these two times events are fairly frequent. For times before 8:00 AM or after 6:00 PM, event frequency drops off sharply.
Trips %>%
mutate(day_of_week = weekdays(sdate)) %>%
ggplot(aes(y = day_of_week)) +
geom_bar(fill = "blue") +
labs(title = "Events vs. Day of the Week", x = "Events", y = "Day of the Week")
The number of events is highest during weekdays and drops off slightly during the weekends. The number of events is fairly even for each week day. Saturday and Sunday also both have similar numbers of events.
Trips %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
mutate(day_of_week = weekdays(sdate)) %>%
ggplot(aes(x = time_of_day)) +
geom_density(fill = "blue") +
labs(title = "Time of Day vs. Events for Each Day of the Week", x = "Time of Day", y = "Events") +
facet_wrap(~day_of_week)
After faceting by day of the week, a pattern is noticeable. On weekdays, events are most frequent around 8:00 AM and 6:00 PM. On the weekends, events are most frequent around 2:00 PM.
The variable client describes whether the renter is a regular user (level Registered) or has not joined the bike-rental organization (Causal). The next set of exercises investigate whether these two different categories of users show different rental behavior and how client interacts with the patterns you found in the previous exercises. Repeat the graphic from Exercise @ref(exr:exr-temp) (d) with the following changes:
fill aesthetic for geom_density() to the client variable. You should also set alpha = .5 for transparency and color=NA to suppress the outline of the density function.Trips %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
mutate(day_of_week = weekdays(sdate)) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5, color = NA) +
labs(title = "Time of Day vs. Events for Each Day of the Week", x = "Time of Day", y = "Events") +
facet_wrap(~day_of_week)
On most days of the week, the Casual and Registered users show different rental behavior. Casual users most frequently rent bikes at around 12:00 PM, while Registered users most frequently rent bikes around 8:00 AM and 6:00 PM.
position = position_stack() to geom_density(). In your opinion, is this better or worse in terms of telling a story? What are the advantages/disadvantages of each?Trips %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
mutate(day_of_week = weekdays(sdate)) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5, color = NA, position = position_stack()) +
labs(title = "Time of Day vs. Events for Each Day of the Week", x = "Time of Day", y = "Events") +
facet_wrap(~day_of_week)
I believe that adding the argument position = position_stack() to geom_density()is worse in terms of telling a story than the previous plot. By stacking the density plots, it is more difficult to see the differences in the behaviors of Casual and Registered users. In the first plot, where the density plots were allowed to overlap, the similarities between Casual and Registered user behavior were apparent because these areas of the plot were a distinct “third” color, and the differences in behaviors were easier to identify immediately. An advantage of stacking the density plots as done in the plot directly above is that you are able to compare the overall number of events by Casual and Registered users more easily.
weekend which will be “weekend” if the day is Saturday or Sunday and “weekday” otherwise (HINT: use the ifelse() function and the wday() function from lubridate). Then, update the graph from the previous problem by faceting on the new weekend variable.Trips %>%
mutate(day_of_week = weekdays(sdate), weekend = ifelse(day_of_week == "Sunday"| day_of_week == "Saturday", "Weekend", "Weekday")) %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
ggplot(aes(x = time_of_day, fill = client)) +
geom_density(alpha = .5, color = NA) +
labs(title = "Time of Day vs. Events for Weekdays and Weekends", x = "Time of Day", y = "Events") +
facet_wrap(~weekend)
After faceting on the new variable weekend, it becomes clearer that on weekdays, the rental behavior of Casual and Registered users is different: Registered users most frequently perform events around 8:00 AM and 6:00 PM, while Casual users most frequently perform events around 2:00 - 3:00 PM. On the weekends, the rental behavior of Casual and Registered users is similar: they both most frequently perform events around 2:00 - 4:00 PM.
client and fill with weekday. What information does this graph tell you that the previous didn’t? Is one graph better than the other?Trips %>%
mutate(day_of_week = weekdays(sdate), weekend = ifelse(day_of_week == "Sunday"| day_of_week == "Saturday", "Weekend", "Weekday")) %>%
mutate(hour_of_day = hour(sdate), minute_of_day = minute(sdate), time_of_day = hour_of_day + minute_of_day * 1/60) %>%
ggplot(aes(x = time_of_day, fill = weekend)) +
geom_density(alpha = .5, color = NA) +
labs(title = "Time of Day vs. Events for Weekdays and Weekends", x = "Time of Day", y = "Events") +
facet_wrap(~client)
After, faceting by ‘client’ and filling by ‘weekend’, I can see the the rental behavior of casual users is fairly similar for weekdays and weekends (most frequently perform events around 2:00 PM). I can also see from the plot on the right that the rental behavior of Registered users is different on weekdays than on weekends. On weekdays, Registered users most frequently perform events around 8:00 AM and 6:00 PM, while on weekends, they most frequently perform events around 2:00 - 3:00 PM. I believe graph is better than the previous one because you are able to tell that the rental behavior of Casual users stays relatively the same throughout the entire week, whereas Registered users show different rental behavior on weekdays than they do on weekends.
Stations to make a visualization of the total number of departures from each station in the Trips data. Use either color or size to show the variation in number of departures. We will improve this plot next week when we learn about maps!Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name) %>%
mutate(total_departures = n()) %>%
ggplot(aes(x = long, y = lat, size = total_departures)) +
geom_jitter(alpha = 0.3) +
labs(title = "Total Departures from Each Rental Location (in spatial relation to each other) ", x = "Longitude", y ="Latitude")
Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
group_by(name, long, lat) %>%
summarize(percent_casual= mean(client == "Casual")) %>%
ggplot(aes(x = long, y = lat, color = percent_casual)) +
geom_point() +
labs(title = "Total Departures from Each Rental Location (in spatial relation to each other",
x = "Longitude",
y ="Latitute")
The areas with a much higher percentage of Casual users are almost all centrally located with respect to the other rental locations.
as_date(sdate) converts sdate from date-time format to date format.Stations %>%
left_join(Trips,
by = c("name" = "sstation")) %>%
mutate(just_date = as_date(sdate)) %>%
group_by(name, just_date) %>%
mutate(total_departures = n()) %>%
select(name, just_date, total_departures) %>%
ungroup(name, just_date) %>%
distinct() %>%
arrange(desc(total_departures)) %>%
slice_head(n = 10) -> Top_Ten_Most_Departures
Top_Ten_Most_Departures
Trips %>%
mutate(just_date = as_date(sdate)) -> Trips1
Top_Ten_Most_Departures %>%
left_join(Trips1,
by = c("name" = "sstation",
"just_date"))
Trips %>%
mutate(just_date = as_date(sdate), day_of_week = wday(sdate, label = TRUE)) -> Trips1
Top_Ten_Most_Departures %>%
left_join(Trips1,
by = c("name" = "sstation",
"just_date")) %>%
group_by(client, day_of_week) %>%
summarize(n = n()) %>%
mutate(proportion = n / sum(n)) %>%
pivot_wider(id_cols = day_of_week,
names_from = client,
values_from = proportion)
A large proportion of Casual users’ trips (~0.92) are made on the weekends, Saturday and Sunday. For Registered users, the proportion of trips are more evenly spread throughout each day of the week. Only a relatively small proportion of Registered users’ trips (~0.16) are made on the weekend.
DID YOU REMEMBER TO GO BACK AND CHANGE THIS SET OF EXERCISES TO THE LARGER DATASET? IF NOT, DO THAT NOW.
This problem uses the data from the Tidy Tuesday competition this week, kids. If you need to refresh your memory on the data, read about it here.
facet_geo(). The graphic won’t load below since it came from a location on my computer. So, you’ll have to reference the original html on the moodle page to see it.kids <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-15/kids.csv')
kids %>%
filter(year == c("1997", "2007")) %>%
mutate(inf_adj_perchild = inf_adj_perchild * 1000) %>%
pivot_wider(id_cols = state,
names_from = year,
names_prefix = "year_",
values_from = inf_adj_perchild) %>%
mutate(diff = year_2016 - year_1997) %>%
ggplot(aes(x = inf_adj_perchild)) +
geom_point() +
geom_segment() +
facet_geo(vars(state))
## Error: Problem with `mutate()` input `diff`.
## x object 'year_2016' not found
## ℹ Input `diff` is `year_2016 - year_1997`.
DID YOU REMEMBER TO UNCOMMENT THE OPTIONS AT THE TOP?